community: FAISS vectorstore - consistent Document id field #27101

nhols · 2024-10-04T11:18:40Z

Set id field of documents in a FAISS docstore to be consistent with values in index_to_docstore_id
Implement get_by_ids method for FAISS vectorstore
Add tests

vercel · 2024-10-04T11:18:45Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Dec 15, 2024 4:58pm

…8274) Thank you for contributing to LangChain! Ctrl+F to find instances of `langchain-databricks` and replace with `databricks-langchain`. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. Signed-off-by: Prithvi Kannan <[email protected]>

- **docs: poetry publish** - **x** - **x** - **x** - **x** - **x** - **x** - **x** - **x** - **x**

…ain-ai#28269) - **Description:** Corrected the parameter name in the HuggingFaceEmbeddings documentation under integrations/text_embedding/ from model to model_name to align with the actual code usage in the langchain_huggingface package. - **Issue:** Fixes langchain-ai#28231 - **Dependencies:** None

…ssages` (langchain-ai#28267) We have a test [test_structured_few_shot_examples](https://github.com/langchain-ai/langchain/blob/ad4333ca032033097c663dfe818c5c892c368bd6/libs/standard-tests/langchain_tests/integration_tests/chat_models.py#L546) in standard integration tests that implements a version of tool-calling few shot examples that works with ~all tested providers. The formulation supported by ~all providers is: `human message, tool call, tool message, AI reponse`. Here we update `langchain_core.utils.function_calling.tool_example_to_messages` to support this formulation. The `tool_example_to_messages` util is undocumented outside of our API reference. IMO, if we are testing that this function works across all providers, it can be helpful to feature it in our guides. The structured few-shot examples we document at the moment require users to implement this function and can be simplified.

langchain-ai#28296) **Description:** Currently, the docstring for `LanceDB.__init__()` provides the default value for `mode`, but not the list of valid values. This PR adds that list to the docstring. **Issue:** N/A **Dependencies:** N/A **Twitter handle:** `@metadaddy` [Leaving as a reminder: If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.]

@RyanMagnuson

courtesy of @RyanMagnuson

…i#28297)

…chain-ai#28304) Link migration guide first.

- fix import statement for qdrant - issue: langchain-ai#28012 langchain-ai#28012

pydantic 2.10 compat for langchain-core

pydantic compat 2.10 for langchain

fix small GOOGLE_API_KEY markdown formatting typo

Adds deprecation notices for Neo4j components moving to the `langchain_neo4j` partner package. - Adds deprecation warnings to all Neo4j-related classes and functions that have been migrated to the new `langchain_neo4j` partner package - Updates documentation to reference the new `langchain_neo4j` package instead of `langchain_community`

) JSONparse, in _validate_metadata_func(), checks the consistency of the _metadata_func() function. To do this, it invokes it and makes sure it receives a dictionary in response. However, during the call, it does not respect future calls, as shown on line 100. This generates errors if, for example, the function is like this: ```python def generate_metadata(json_node:Dict[str,Any],kwargs:Dict[str,Any]) -> Dict[str,Any]: return { "source": url, "row": kwargs['seq_num'], "question":json_node.get("question"), } loader = JSONLoader( file_path=file_path, content_key="answer", jq_schema='.[]', metadata_func=generate_metadata, text_content=False) ``` To avoid this, the verification must comply with the specifications. This patch does just that. --------- Co-authored-by: Eugene Yurtsev <[email protected]>

…i#25375) community: add hybrid search in opensearch # Langchain OpenSearch Hybrid Search Implementation ## Implementation of Hybrid Search: I have taken LangChain's OpenSearch integration to the next level by adding hybrid search capabilities. Building on the existing OpenSearchVectorSearch class, I have implemented Hybrid Search functionality (which combines the best of both keyword and semantic search). This new functionality allows users to harness the power of OpenSearch's advanced hybrid search features without leaving the familiar LangChain ecosystem. By blending traditional text matching with vector-based similarity, the enhanced class delivers more accurate and contextually relevant results. It's designed to seamlessly fit into existing LangChain workflows, making it easy for developers to upgrade their search capabilities. In implementing the hybrid search for OpenSearch within the LangChain framework, I also incorporated filtering capabilities. It's important to note that according to the OpenSearch hybrid search documentation, only post-filtering is supported for hybrid queries. This means that the filtering is applied after the hybrid search results are obtained, rather than during the initial search process. **Note:** For the implementation of hybrid search, I strictly followed the official OpenSearch Hybrid search documentation and I took inspiration from https://github.com/AndreasThinks/langchain/tree/feature/opensearch_hybrid_search Thanks Mate! ### Experiments I conducted few experiments to verify that the hybrid search implementation is accurate and capable of reproducing the results of both plain keyword search and vector search. Experiment - 1 Hybrid Search Keyword_weight: 1, vector_weight: 0 I conducted an experiment to verify the accuracy of my hybrid search implementation by comparing it to a plain keyword search. For this test, I set the keyword_weight to 1 and the vector_weight to 0 in the hybrid search, effectively giving full weightage to the keyword component. The results from this hybrid search configuration matched those of a plain keyword search, confirming that my implementation can accurately reproduce keyword-only search results when needed. It's important to note that while the results were the same, the scores differed between the two methods. This difference is expected because the plain keyword search in OpenSearch uses the BM25 algorithm for scoring, whereas the hybrid search still performs both keyword and vector searches before normalizing the scores, even when the vector component is given zero weight. This experiment validates that my hybrid search solution correctly handles the keyword search component and properly applies the weighting system, demonstrating its accuracy and flexibility in emulating different search scenarios. Experiment - 2 Hybrid Search keyword_weight = 0.0, vector_weight = 1.0 For experiment-2, I took the inverse approach to further validate my hybrid search implementation. I set the keyword_weight to 0 and the vector_weight to 1, effectively giving full weightage to the vector search component (KNN search). I then compared these results with a pure vector search. The outcome was consistent with my expectations: the results from the hybrid search with these settings exactly matched those from a standalone vector search. This confirms that my implementation accurately reproduces vector search results when configured to do so. As with the first experiment, I observed that while the results were identical, the scores differed between the two methods. This difference in scoring is expected and can be attributed to the normalization process in hybrid search, which still considers both components even when one is given zero weight. This experiment further validates the accuracy and flexibility of my hybrid search solution, demonstrating its ability to effectively emulate pure vector search when needed while maintaining the underlying hybrid search structure. Experiment - 3 Hybrid Search - balanced keyword_weight = 0.5, vector_weight = 0.5 For experiment-3, I adopted a balanced approach to further evaluate the effectiveness of my hybrid search implementation. In this test, I set both the keyword_weight and vector_weight to 0.5, giving equal importance to keyword-based and vector-based search components. This configuration aims to leverage the strengths of both search methods simultaneously. By setting both weights to 0.5, I intended to create a scenario where the hybrid search would consider lexical matches and semantic similarity equally. This balanced approach is often ideal for many real-world applications, as it can capture both exact keyword matches and contextually relevant results that might not contain the exact search terms. Kindly verify the notebook for the experiments conducted! **Notebook:** https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb ### Instructions to follow for Performing Hybrid Search: **Step-1: Instantiating OpenSearchVectorSearch Class:** ```python opensearch_vectorstore = OpenSearchVectorSearch( index_name=os.getenv("INDEX_NAME"), embedding_function=embedding_model, opensearch_url=os.getenv("OPENSEARCH_URL"), http_auth=(os.getenv("OPENSEARCH_USERNAME"),os.getenv("OPENSEARCH_PASSWORD")), use_ssl=False, verify_certs=False, ssl_assert_hostname=False, ssl_show_warn=False ) ``` **Parameters:** 1. **index_name:** The name of the OpenSearch index to use. 2. **embedding_function:** The function or model used to generate embeddings for the documents. It's assumed that embedding_model is defined elsewhere in the code. 3. **opensearch_url:** The URL of the OpenSearch instance. 4. **http_auth:** A tuple containing the username and password for authentication. 5. **use_ssl:** Set to False, indicating that the connection to OpenSearch is not using SSL/TLS encryption. 6. **verify_certs:** Set to False, which means the SSL certificates are not being verified. This is often used in development environments but is not recommended for production. 7. **ssl_assert_hostname:** Set to False, disabling hostname verification in SSL certificates. 8. **ssl_show_warn:** Set to False, suppressing SSL-related warnings. **Step-2: Configure Search Pipeline:** To initiate hybrid search functionality, you need to configures a search pipeline first. **Implementation Details:** This method configures a search pipeline in OpenSearch that: 1. Normalizes the scores from both keyword and vector searches using the min-max technique. 2. Applies the specified weights to the normalized scores. 3. Calculates the final score using an arithmetic mean of the weighted, normalized scores. **Parameters:** * **pipeline_name (str):** A unique identifier for the search pipeline. It's recommended to use a descriptive name that indicates the weights used for keyword and vector searches. * **keyword_weight (float):** The weight assigned to the keyword search component. This should be a float value between 0 and 1. In this example, 0.3 gives 30% importance to traditional text matching. * **vector_weight (float):** The weight assigned to the vector search component. This should be a float value between 0 and 1. In this example, 0.7 gives 70% importance to semantic similarity. ```python opensearch_vectorstore.configure_search_pipelines( pipeline_name="search_pipeline_keyword_0.3_vector_0.7", keyword_weight=0.3, vector_weight=0.7, ) ``` **Step-3: Performing Hybrid Search:** After creating the search pipeline, you can perform a hybrid search using the `similarity_search()` method (or) any methods that are supported by `langchain`. This method combines both `keyword-based and semantic similarity` searches on your OpenSearch index, leveraging the strengths of both traditional information retrieval and vector embedding techniques. **parameters:** * **query:** The search query string. * **k:** The number of top results to return (in this case, 3). * **search_type:** Set to `hybrid_search` to use both keyword and vector search capabilities. * **search_pipeline:** The name of the previously created search pipeline. ```python query = "what are the country named in our database?" top_k = 3 pipeline_name = "search_pipeline_keyword_0.3_vector_0.7" matched_docs = opensearch_vectorstore.similarity_search_with_score( query=query, k=top_k, search_type="hybrid_search", search_pipeline = pipeline_name ) matched_docs ``` twitter handle: @iamkarthik98 --------- Co-authored-by: Karthik Kolluri <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>

…gchain-ai#28374) **Description:** This PR introduces a `model` alias for the embedding classes that contain the attribute `model_name`, to ensure consistency across the codebase, as suggested by a moderator in a previous PR. The change aligns the usage of attribute names across the project (see for example [here](https://github.com/langchain-ai/langchain/blob/65deeddd5dfec5d51f33ebc961f09c2e47a8f064/libs/partners/groq/langchain_groq/chat_models.py#L304)). **Issue:** This PR addresses the suggestion from the review of issue langchain-ai#28269. **Dependencies:** None --------- Co-authored-by: Eugene Yurtsev <[email protected]> Co-authored-by: Erick Friis <[email protected]>

…rkdownifyTransformer` (langchain-ai#27866) # Description Implements the `atransform_documents` method for `MarkdownifyTransformer` using the `asyncio` built-in library for concurrency. Note that this is mainly for API completeness when working with async frameworks rather than for performance, since the `markdownify` function is not I/O bound because it works with `Document` objects already in memory. # Issue Fixes langchain-ai#27865 # Dependencies No new dependencies added, but [`markdownify`](https://github.com/matthewwithanm/python-markdownify) is required since this PR updates the `markdownify` integration. # Tests and docs - Tests added - I did not modify the docstrings since they already described the basic functionality, and [the API docs also already included a description](https://python.langchain.com/api_reference/community/document_transformers/langchain_community.document_transformers.markdownify.MarkdownifyTransformer.html#langchain_community.document_transformers.markdownify.MarkdownifyTransformer.atransform_documents). If it would be helpful, I would be happy to update the docstrings and/or the API docs. # Lint and test - [x] format - [x] lint - [x] test I ran formatting with `make format`, linting with `make lint`, and confirmed that tests pass using `make test`. Note that some unit tests pass in CI but may fail when running `make_test`. Those unit tests are: - `test_extract_html` (and `test_extract_html_async`) - `test_strip_tags` (and `test_strip_tags_async`) - `test_convert_tags` (and `test_convert_tags_async`) The reason for the difference is that there are trailing spaces when the tests are run in the CI checks, and no trailing spaces when run with `make test`. I ensured that the tests pass in CI, but they may fail with `make test` due to the addition of trailing spaces. --------- Co-authored-by: Erick Friis <[email protected]>

Thank you for contributing to LangChain! **PR title**: "community: fix PDF Filter Type Error" - **Description:** fix PDF Filter Type Error" - **Issue:** the issue langchain-ai#27153 it fixes, - **Dependencies:** no - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]>

efriis · 2024-12-14T00:03:58Z

hey there! I believe this conflicted with a change adding a filtering syntax to FAISS. Would you be interested in re-implementing just the get_by_ids function based on those filters? Otherwise will probably close for now!

Added Langchain complete tutorial playlist from total technology zonne channel .In this playlist every video is focusing one specific use case and hands on demo.All tutorials are equally good for every levels . Thank you for contributing to LangChain! - [ ] **PR title**: "package: description" - Where "package" is whichever of langchain, community, core, etc. is being modified. Use "docs: ..." for purely docs changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" - [ ] **PR message**: ***Delete this entire checklist*** and replace with - **Description:** a description of the change - **Issue:** the issue # it fixes, if applicable - **Dependencies:** any dependencies required for this change - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [ ] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. - [ ] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Erick Friis <[email protected]>

Co-authored-by: Jesse Schumacher <[email protected]> Co-authored-by: Jesse S <[email protected]> Co-authored-by: dylan <[email protected]>

Issue: Here is an ambiguity about W&B integrations. There are two existing provider pages. Fix: Added the "root" W&B provider page. Added there the references to the documentation in the W&B site. Cleaned up formats in existing pages. Added one more integration reference. --------- Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>

- **Description:** Adds a helper that renders documents with the GraphVectorStore metadata fields to Graphviz for visualization. This is helpful for understanding and debugging. --------- Co-authored-by: Erick Friis <[email protected]>

Thank you for contributing to LangChain! - [x] **PR title**: langchain: add URL parameter to ChatDeepInfra class - [x] **PR message**: add URL parameter to ChatDeepInfra class - **Description:** This PR introduces a url parameter to the ChatDeepInfra class in LangChain, allowing users to specify a custom URL. Previously, the URL for the DeepInfra API was hardcoded to "https://stage.api.deepinfra.com/v1/openai/chat/completions", which caused issues when the staging endpoint was not functional. The _url method was updated to return the value from the url parameter, enabling greater flexibility and addressing the problem. out! --------- Co-authored-by: Erick Friis <[email protected]>

Bump unstructured to pick up resolution of Unstructured-IO/unstructured#3795

… values in index_to_docstore_id, implement get_by_ids method

…to faiss-doc-ids

nhols · 2024-12-15T18:39:37Z

@efriis opened a new PR! #28728

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module labels Oct 4, 2024

prithvikannan and others added 26 commits November 21, 2024 20:03

cli: release 0.0.33 (langchain-ai#28278)

4ccb3e6

docs: poetry publish 2 (langchain-ai#28277)

9a717c9

- **docs: poetry publish** - **x** - **x** - **x** - **x** - **x** - **x** - **x** - **x** - **x**

groq,openai,mistralai: fix unit tests (langchain-ai#28279)

29f8a79

docs: poetry publish 3 (langchain-ai#28280)

65deedd

core[patch]: release 0.3.20 (langchain-ai#28293)

697dda5

ollama: include kwargs in requests (langchain-ai#28299)

7277794

courtesy of @RyanMagnuson

partners/ollama: release 0.2.2rc1 (langchain-ai#28300)

aa7fa80

community[patch]: fix errors introduced by pydantic 2.10 (langchain-a…

203d20c

…i#28297)

infra: install standard tests in docs build (langchain-ai#28303)

242e9fc

langchain[patch]: update deprecation message for MapReduceChain (lang…

25a636c

…chain-ai#28304) Link migration guide first.

infra: more rst (langchain-ai#28305)

39fd0fd

docs: integration asyncio mode (langchain-ai#28306)

a329647

[Doc] Improvement: fix import statement for qdrant (langchain-ai#28286)

ed84d48

- fix import statement for qdrant - issue: langchain-ai#28012 langchain-ai#28012

core[patch]: Compat pydantic 2.10 (langchain-ai#28308)

a813d11

pydantic 2.10 compat for langchain-core

langchain[patch]: Compat with pydantic 2.10 (langchain-ai#28307)

563587e

pydantic compat 2.10 for langchain

docs: standard test api link (langchain-ai#28309)

7170a4e

core[patch]: release 0.3.21 (langchain-ai#28314)

f5f1149

langchain[patch]: release 0.3.8 (langchain-ai#28315)

82bb0cd

community[patch]: release 0.3.8 (langchain-ai#28316)

a83357d

docs: fix GOOGLE_API_KEY typo (langchain-ai#28322)

6ed2d38

fix small GOOGLE_API_KEY markdown formatting typo

community: fixed critical bugs at Writer provider (langchain-ai#27879)

c60695a

pprados and others added 8 commits December 13, 2024 21:24

docs: dropdowns for embeddings and vector stores (langchain-ai#28713)

9c55c75

chroma[patch]: Update logic for assigning ids

b909d54

huggingface: fix standard test lint (langchain-ai#28714)

3107d78

efriis self-assigned this Dec 14, 2024

ronidas39 and others added 13 commits December 14, 2024 00:21

docs, community: aerospike docs update (langchain-ai#28717)

288f204

Co-authored-by: Jesse Schumacher <[email protected]> Co-authored-by: Jesse S <[email protected]> Co-authored-by: dylan <[email protected]>

core: release 0.3.25 (langchain-ai#28718)

387284c

infra: fix notebook tests (langchain-ai#28722)

23b433f

Bump unstructured to pick up resolution of Unstructured-IO/unstructured#3795

text-splitters[patch]: Release 0.3.3 (langchain-ai#28723)

679e3a9

langchain[patch]: Release 0.3.12 (langchain-ai#28724)

089e659

community[patch]: Release 0.3.12 (langchain-ai#28725)

a0534ae

make sure id field of Documents in FAISS docstore have the same id as…

17f4d1f

… values in index_to_docstore_id, implement get_by_ids method

assert object is Document for linting

2d609f8

Merge branch 'faiss-doc-ids' of https://github.com/nhols/langchain in…

f820c6e

…to faiss-doc-ids

nhols requested review from efriis, baskaryan and ccurme as code owners December 15, 2024 16:58

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 15, 2024

nhols closed this Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: FAISS vectorstore - consistent Document id field #27101

community: FAISS vectorstore - consistent Document id field #27101

nhols commented Oct 4, 2024

vercel bot commented Oct 4, 2024 •

edited

Loading

efriis commented Dec 14, 2024

nhols commented Dec 15, 2024 •

edited

Loading

community: FAISS vectorstore - consistent Document id field #27101

community: FAISS vectorstore - consistent Document id field #27101

Conversation

nhols commented Oct 4, 2024

vercel bot commented Oct 4, 2024 • edited Loading

efriis commented Dec 14, 2024

nhols commented Dec 15, 2024 • edited Loading

vercel bot commented Oct 4, 2024 •

edited

Loading

nhols commented Dec 15, 2024 •

edited

Loading